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Abstract 

We address the problem of reinforcement learning in which observations may 
exhibit an arbitrary form of stochastic dependence on past observations and 
actions. The task for an agent is to attain the best possible asymptotic reward 
where the true generating environment is unknown but belongs to a known 
countable family of environments. We find some sufficient conditions on the 
class of environments under which an agent exists which attains the best 
asymptotic reward for any environment in the class. We analyze how tight 
these conditions are and how they relate to different probabilistic assumptions 
known in reinforcement learning and related fields, such as Markov Decision 
Processes and mixing conditions. 
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1 Introduction 



Many real- world "learning" problems (like learning to drive a car or playing a game) 
can be modelled gent TT that interacts with an environment /i and is (occa- 

sionally) rewarded for its behavior. We are interested in agents which perform well 
in the sense of having high long-term reward, also called the value V{ii,7t) of agent tt 
in environment /i. If /i is known, it is a pure (non-learning) computational problem 
to determine the optimal agent tt^ := argmax^y(//,7r). It is far less clear what an 
"optimal" agent means, if /i is unknown. A reasonable objective is to have a single 
policy TT with high value simultaneously in many environments. We will formalize 
and call this criterion self- optimizing later. 

Learning approaches in reactive worlds. Reinforcement learning, sequential 
decision theory, adaptive control theory, and active expert advice, are theories deal- 
ing with this problem. They overlap but have different core focus: Reinforcement 
learning algorithms [SB98] are developed to learn fi or directly its value. Temporal 
difference learning is computationally very efficient, but has slow asymptotic guar- 
antees (only) in (effectively) small observable MDPs. Others have faster guarantee 
in finite state MDPs [BT99]. There are algorithms [EDKM05] which are optimal 
for any finite connected POMDP, and this is apparently the largest class of envi- 
ronments considered. In sequential decision theory, a Bayes-optimal agent tt* that 
maximizes V{^,Tr) is considered, where ^s a mixture of environments ueC and C is 
a class of environments that contains the true environment heC [Hut05]. Policy tt* 
is self-optimizing in an arbitrary class C, provided C allows for self-optimizingness 
[Hut02]. Adaptive control theory [KV86] considers very simple (from an AI per- 
spective) or special systems (e.g. linear with quadratic loss function), which some- 
times allow computationally and data efficient solutions. Action with expert advice 
[dFM04, PH05, PH06, CBL06] constructs an agent (called master) that performs 
nearly as well as the best agent (best expert in hindsight) from some class of experts, 
in any environment u. The important special case of passive sequence prediction in 
arbitrary unknown environments, where the actions=predictions do not affect the 
environment is comparably easy [Hut03, HP04]. 

The difficulty in active learning problems can be identified (at least, for countable 
classes) with traps in the environments. Initially the agent docs not know fi, so has 
asymptotically to be forgiven in taking initial "wrong" actions. A well-studied such 
class are ergodic MDPs which guarantee that, from any action history, every state 
can be (re) visited [Hut02] . 

What's new. The aim of this paper is to characterize as general as possible classes 
C in which self-optimizing behaviour is possible, more general than POMDPs. To 
do this we need to characterize classes of environments that forgive. For instance, 
exact state recovery is unnecessarily strong; it is sufficient being able to recover high 
rewards, from whatever states. Further, in many real world problems there is no 
information available about the "states" of the environment (e.g. in POMDPs) or 
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the environment may exhibit long history dependencies. 

Rather than trying to model an environment (e.g. by MDP) we try to identify 
the conditions sufficient for learning. Towards this aim, we propose to consider only 
environments in which, after any arbitrary finite sequence of actions, the best value is 
still achievable. The performance criterion here is asymptotic average reward. Thus 
we consider such environments for which there exists a policy whose asymptotic 
average reward exists and upper-bounds asymptotic average reward of any other 
policy. Moreover, the same property should hold after any finite sequence of actions 
has been taken (no traps). 

Yet this property in itself is not sufficient for identifying optimal behavior. We 
require further that, from any sequence of k actions, it is possible to return to the 
optimal level of reward in o{k) steps. (The above conditions will be formulated in a 
probabilistic form.) Environments which possess this property are called (strongly) 
value-stable. 

We show that for any countable class of value-stable environments there exists a 
policy which achieves best possible value in any of the environments from the class 
(i.e. is self- optimizing for this class). We also show that strong value-stability is in 
a certain sense necessary. 

We also consider examples of environments which possess strong value-stability. 
In particular, any ergodic MDP can be easily shown to have this property. A 
mixing-type condition which implies value-stability is also demonstrated. Finally, 
we provide a construction allowing to build examples of value-stable environments 
which are not isomorphic to a finite POMDP, thus demonstrating that the class of 
value-stable environments is quite general. 

It is important in our argument that the class of environments for which we seek 
a self-optimizing policy is countable, although the class of all value-stable environ- 
ments is uncountable. To find a set of conditions necessary and sufficient for learning 
which do not rely on countability of the class is yet an open problem. However, from 
a computational perspective countable classes are sufficiently large (e.g. the class of 
all computable probability measures is countable). 

Contents. The paper is organized as follows. Section 2 introduces necessary no- 
tation of the agent framework. In Section 3 we define and explain the notion of 
value-stability, which is central in the paper. Section 4 presents the theorem about 
self-optimizing policies for classes of value-stable environments, and illustrates the 
applicabihty of the theorem by providing examples of strongly value-stable environ- 
ments. In Section 5 we discuss necessity of the conditions of the main theorem. 
Section 6 provides some discussion of the results and an outlook to future research. 
The formal proof of the main theorem is given in the appendix, while Section 4 
contains only intuitive explanations. 
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2 Notation & Definitions 



We essentially follow the notation of [Hut02, Hut05]. 

Strings and probabilities. We use letters i,k,l,m,n e IN for natural numbers, 
and denote the cardinality of sets S by #5. Wc write X* for the set of finite 
strings over some alphabet X ^ and X°° for the set of infinite sequences. For a string 
X G X* of length ^{x) = n we write XiX2...Xn with Xt E X and further abbreviate 
Xk:n-=XkXk+i...Xn-iXn and x<n ■= Xi- ..Xn-i- Finally, we define Xk..n-= Xk + ...+Xn, 
provided elements of X can be added. 

We assume that sequence uj = uji.oc.eX'^ is sampled from the "true" probability 
measure fi, i.e. P[u;i.„ = a;i:„] = We denote expectations w.r.t. /i by E, i.e. 
for a function / : A*" ^ iR, E[/] = E[/(a;i:„)] = Ea;i.„A*(^i:n)/(^i:n)- When we use 
probabilities and expectations with respect to other measures we make the notation 
explicit, e.g. E,^ is the expectation with respect to i/. Measures ui and 1/2 are called 
singular if there exists a set A such that uilAj—O and i'2{A) — l. 

The agent framework is general enough to allow modelling nearly any kind of (in- 
telligent) system [RN95]. In cycle k, an agent performs action yk^y (output) which 
results in observation Ok^O and reward rf^ETZ, followed by cycle k + 1 and so on. 
We assume that the action space y, the observation space O, and the reward space 
TZclR are finite, w.l.g. TZ—{0,...,rjnax}- We abbreviate z^'.—yhrkOk^Z-.—yxTlxO 
and Xk — rkOk& X :—TZxO. An agent is identified with a (probabilistic) policy 71. 
Given history z^^., the probability that agent vr acts yk in cycle k is (by definition) 
'^{yk\z<k)- Thereafter, environment fi provides (probabilistic) reward and observa- 
tion Ok, i.e. the probability that the agent perceives Xk is (by definition) n{xk\z<:kyk)- 
Note that policy and environment are allowed to depend on the complete history. 
We do not make any MDP or POMDP assumption here, and we don't talk about 
states of the environment, only about observations. Each (policy,environment) pair 
generates an 1/0 sequence z'l^Z2^ .... Mathematically, history ^^.^ is a random 
variable with probability 

P[^r:fe = = ^{Vl) • • ■■■ • T^{yk\z<k) ■ l^{Xk\z<kyk) 

Since value optimizing policies can always be chosen deterministic, there is no real 
need to consider probabilistic policies, and henceforth we consider deterministic 
policies p. We assume that /xGC is the true, but unknown, environment, and u^C 
a generic environment. 

3 Setup 

For an environment u and a policy p define random variables (lower and upper 
average value) 

V{iy,p) := limsup {ir^*;^} and V{iy,p) := liminf {^r?';^} 
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where ri„m- = ri + ...+rm- If there exists a constant V such that 

V(u,p) = V(u,p) = V a.s. 

then we say that the hmiting average value exists and denote it by V{u,p)=:V . 

An environment v is explorable if there exists a pohcy Pi, such that Viv^py) 
exists and V{v,p) <V{v,pi,) with probabihty 1 for every pohcy p. In this case define 
V:: = V{v,p,). 

A policy p is self- optimizing for a set of environments C if V{u,p) = V* for every 
ueC. 

Definition 1 (value-stable environments). An explorable environment v is 
(strongly) value-stable if there exist a sequence of numbers G [0,rmaa:] and two 
functions d„{k,e) and ip^{n,e) such that ^r^^ ,^^V* , di,{k,s) = o{k), ^^iV'i/l?^,^^) <oo 
for every fixed e, and for every k and every history z^k there exists a policy p—pl<'' 
such that 

P {rLk+n - rZk+n > s) + HE \ Z<k) < s). (l) 

First of all, this condition means that the strong law of large numbers for re- 
wards holds uniformly over histories z^^k] the numbers here can be thought of 
as expected rewards of an optimal policy. Furthermore, from any (bad) sequence 
of k actions it is possible (knowing the environment) to recover up to o{k) reward 
loss; to recover means to reach the level of reward obtained by the optimal policy 
which from the beginning was taking only optimal actions. That is, suppose that 
a person A has made k possibly suboptimal actions and after that "realized" what 
the true environment was and how to act optimally in it. Suppose that a person 
B was from the beginning taking only optimal actions. We want to compare the 
performance of A and B on first n steps after the step k. An environment is strongly 
value stable if A can catch up with B except for o(k) gain. The numbers r^ can be 
thought of as expected rewards of B; A can catch up with B up to the reward loss 
di^{k,e) with probability Lpi,(n,e), where the latter does not depend on past actions 
and observations (the law of large numbers holds uniformly). 

In the next section after presenting the main theorem we consider examples of 
families of strongly-values stable environments. 

4 Main Results 

In this section we present the main sclf-optimizingness result along with an informal 
explanation of its proof, and illustrate the applicability of this result with examples 
of classes of value-stable environments. 

Theorem 2 (value-stabIe=^self-optimizing). For any countable class C of 
strongly value-stable environments, there exists a policy which is self- optimizing for 
C. 
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A formal proof is given in the appendix; here we give some intuitive justification. 
Suppose that all environments in C are deterministic. We will construct a self- 
optimizing policy p as follows: Let be the first environment in C. The algorithm 
assumes that the true environment is i/* and tries to get e-close to its optimal value 
for some (small) e. This is called an exploitation part. If it succeeds, it does some 
exploration as follows. It picks the first environment i/^ which has higher average 
asymptotic value than {V*e > V*t) and tries to get £-close to this value acting 
optimally under u^. If it can not get close to the i/^-optimal value then i/^ is not 
the true environment, and the next environment can be picked for exploration (here 
we call "exploration" successive attempts to exploit an environment which differs 
from the current hypothesis about the true environment and has a higher average 
reward). If it can, then it switches to exploitation of i/*, exploits it until it is s'-close 
to V*t , e' <e and switches to again this time trying to get s'-close to ; and 
so on. This can happen only a finite number of times if the true environment is z/*, 
since V*t <V*e. Thus after exploration either z/* or u'^ is found to be inconsistent with 
the current history. If it is then just the next environment such that V*e > V*t 
is picked for exploration. If it is then the first consistent environment is picked 
for exploitation (and denoted v*). This in turn can happen only a finite number of 
times before the true environment v is picked as z/*. After this, the algorithm still 
continues its exploration attempts, but can always keep within £jt^O of the optimal 
value. This is ensured by d{k)=o{k). 

The probabilistic case is somewhat more complicated since we can not say 
whether an environment is "consistent" with the current history. Instead we test 
each environment for consistency as follows. Let ^ be a mixture of all environments 
in C. Observe that together with some fixed policy each environment ^ can be con- 
sidered as a measure on Z^^ . Moreover, it can be shown that (for any fixed policy) 
the ratio is bounded away from zero if u is the true environment fi and tends 

to zero if u is singular with (in fact, here singularity is a probabilistic analogue 
of inconsistency). The exploration part of the algorithm ensures that at least one 
of the environments z/* and z/^ is singular with v on the current history, and a suc- 
cession of tests > Q!c with a. — > is used to exclude such environments from 
consideration. 

The next proposition provides some conditions on mixing rates which are suf- 
ficient for value-stability; we do not intend to provide sharp conditions on mixing 
rates but rather to illustrate the relation of value-stability with mixing conditions. 

We say that a stochastic process hk, kElN satisfies strong ct-mixing conditions 
with coefficients a{k) if (see e.g. [Bos96]) 

sup sup \P(Br]C) -F(B)F(C)\ <a(k), 

n£N B£a(hi,...,hn),Cea(hn+k,- ) 

where a{) stands for the sigma-algebra generated by the random variables in brack- 
ets. Loosely speaking, mixing coefficients a refiect the speed with which the process 
"forgets" about its past. 
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Proposition 3 (mixing conditions). Suppose that an explorable environment v 
is such that there exist a sequence of numbers r^ and a function d{k) such that 
^'"i..n ~^ ^(^) — and for each z<k there exists a policy p such that the 

sequence r^^ satisfies strong a-mixing conditions with coefficients a{k) — for 
some £>0 and 

rl..,^n - E (rr,+„ I z^u) < d{k) 
for any n. Then v is value- stable. 

Proof. Using the union bound we obtain 

P {rLk+n - rZk+n > d{k) + ne) 

< I - ^rZ,^n > d{k)) + P (|rr,+„ - Er^.+J > ns) . 

The first term equals by assumption and the second term for each e can be shown 
to be summablc using [Bos96, Thm.1.3]: For a sequence of uniformly bounded zero- 
mean random variables satisfying strong a-mixing conditions the following bound 
holds true for any integer qe[l,n/2\: 

P(|ri..n| >ne)<ce-' -|- cqa 

£ 

for some constant c; in our case we just set q — n^. □ □ 

(PO)MDPs. Applicability of Theorem 2 and Proposition 3 can be illustrated 
on (PO)MDPs. We note that self-optimizing pohcies for (uncountable) classes of 
finite ergodic MDPs and POMDPs are known [BT99, EDKM05]; the aim of the 
present section is to show that value-stability is a weaker requirement than the 
requirements of these models, and also to illustrate applicability of our results. We 
call n a (stationary) Markov decision process (MDP) if the probability of perceiving 
x^ElX., given history z^^kVk only depends on ^^£3^ and In this eA" is 

called a state, X the state space. An MDP is called ergodic if there exists a policy 
under which every state is visited infinitely often with probability 1. An MDP with 
a stationary policy forms a Markov chain. 

An environment is called a (finite) partially observable MDP (POMDP) if there 
is a sequence of random variables taking values in a finite space S called the state 
space, such that x^ depends only on Sk and y^, and Sfe+i is independent of s^k given 
Sk- Abusing notation the sequence si-k is called the underlying Markov chain. A 
POMDP is called ergodic if there exists a policy such that the underlying Markov 
chain visits each state infinitely often with probability 1. 

In particular, any ergodic POMDP u satisfies strong a-mixing conditions with 
coefficients decaying exponentially fast in case there is a set if C 7?. such that v{ri& 
H) = l and viri — rlsi — s,yi — y) ^ for each y G y,s G iS,r G H,i G IN. Thus for any 
such POMDP V we can use Proposition 3 with d{k,e) a constant function to show 
that u is strongly value-stable: 
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Corollary 4 (POMDP=^ value-stable). Suppose that a POMDP v is ergodic and 
there exists a setHdTZ such that v{ri&H)~l and i'{ri = r\si = s,yi = y)^0 for each 
y e y,h e S,r e H, where S is the finite state space of the underlying Markov chain. 
Then v is strongly value-stable. 

However, it is illustrative to obtain this result for MDPs directly, and in a slightly 
stronger form. 

Proposition 5 (MDP=^ value-stable). Any finite-state ergodic MDP u is a 
strongly value-stable environment. 

Proof. Let d{k,e) = 0. Denote by /i the true environment, let 2;<fe be the current 
history and let the current state (the observation Xk) of the environment be a&X, 
where X is the set of all possible states. Observe that for an MDP there is an 
optimal policy which depends only on the current state. Moreover, such a policy is 
optimal for any history. Let be such a policy. Let rf be the expected reward of 
Pfj, on step i. Let /(a,6) =min{n:a;fc_j_„ = 6|a;fc = a}. By ergodicity of fj, there exists a 
policy p for which E/(6,a) is finite (and does not depend on k). A policy p needs to 
get from the state b to one of the states visited by an optimal policy, and then acts 
according to p^. Let /(n) :— "^g^ - We have 

P (kfc..fe+n - r^.k+n] > ne) < supP (|E {rl'^^k+nl^k = a) - r^f^^+^l > ne)) 



< sup P{l{a,b) > f{n)/r^^) 

a,heX 



+ sup P ( E [r'^i^^W = a) - rltfin)..k+n > " /(^) 

a,b&X ^ I ^ 



^k+f{n) 



< sup P{l{a,b) > /(n)/rniax) 



+ sup P (|E {rl%Jx, = a) - rl%^\ > ne - 2/(n) x, = a) . 

In the last term we have the deviation of the reward attained by the optimal policy 
from its expectation. Clearly, both terms are bounded exponentially in n. □ 

□ 

In the examples above the function d{k,e) is a constant and ip{n,e) decays ex- 
ponentially fast. This suggests that the class of value-stable environments stretches 
beyond finite (PO)MDPs. We illustrate this guess by the construction that follows. 

An example of a value-stable environment: infinitely armed bandit. Next we 
present a construction of environments which can not be modelled as finite POMDPs 
but are value-stable. Consider the following environment u. There is a countable 
family C — {Q-i^ IN} of arms, that is, sources generating i.i.d. rewards and 1 
(and, say, empty observations) with some probability Si of the reward being 1. The 
action space y consists of three actions y = {g,u,d}. To get the next reward from 
the current arm Q an agent can use the action g. At the beginning the current arm 
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is Co and then the agent can move between arms as follows: it can move one arm 
"up" using the action u or move "down" to the first environment using the action 
d. The reward for actions u and d is 0. 

Clearly, p is a. POMDP with countably infinite number of states in the underlying 
Markov chain, which (in general) is not isomorphic to a finite POMDP. 

Claim 6. The environment v just constructed is value-stable. 

Proof. Let S = sup^^j^Si. Clearly, V{i',p')<S with probability 1 for any policy p' . 
A policy p which, knowing all the probabilities Si, achieves V{v,p) — V_{v,p) — 5—:V* 
a.s., can be easily constructed. Indeed, find a sequence Cj, J G-^, where for each j 
there is i='-ij such that Cj = 0) satisfying Yrnvj^oo^i. =5. The policy p should carefully 
exploit one by one the arms staying with each arm long enough to ensure that 
the average reward is close to the expected reward with Sj probability, where Sj 
quickly tends to 0, and so that switching between arms has a negligible impact on 
the average reward. Thus u can be shown to be explorable. Moreover, a policy p 
just sketched can be made independent on (observation and) rewards. 

Furthermore, one can modify the policy p (possibly allowing it to exploit each 
arm longer) so that on each time step t (from some t on) we have j{t) < where 
j{t) is the number of the current arm on step t. Thus, after any actions-perceptions 
history z^^ one needs about \/k actions (one action u and enough actions d) to catch 
up with p. So, (1) can be shown to hold with d{k,e) = \/k, Vi the expected reward of 
p on step i (since p is independent of rewards, are independent), and the rates 
(p{n,e) exponential in n. □ □ 

In the above construction we can also allow the action d to bring the agent 
d{i) < i steps down, where i is the number of the current environment according 
to some (possibly randomized) function d{i), thus changing the function dj,{k,e) and 
possibly making it non-constant in e and as close as desirable to linear. 

5 Necessity of value-stability 

Now we turn to the question of how tight the conditions of strong value-stability 
are. The following proposition shows that the requirement d{k,e) = o{k) in (1) can 
not be relaxed. 

Proposition 7 (necessity of d{k,£) = o{k)). There exists a countable family of 
deterministic explorable environments C such that 

• for any ly&C for any sequence of actions y<fe there exists a policy p such that 

<..fc+n = rZk+n for all n>k, 

where are the rewards attained by an optimal policy Pi, (which from the 
beginning was acting optimally), but 
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• for any policy p there exists an environment ueC such that V_{i',p) < V*. 

Clearly, each environment from such a class C satisfies the value stability condi- 
tions with ip{n,e) = except d{k,e) = k^o{k). 

Proof. There are two possible actions yiE{a,b}, three possible rewards {0,1,2} 
and no observations. 

Construct the environment uq as follows: if yi — a then ri — 1 and ii yi — h then 
= for any i G IN . 

For each i let rij denote the number of actions a taken up to step i: ni:=^{j < 
i:yj — a}. For each s>0 construct the environment i/s as follows: ri{a) — l for any 
i, ri{b) = 2 if the longest consecutive sequence of action b taken has length greater 
than Ui and ni>s; otherwise rj(6) =0. 

Suppose that there exists a policy p such that V_{i'i,p) = V*. for each i>0 and let 
the true environment be uq. By assumption, for each s there exists such n that 

#{i < n : yj = 6, Tj = 0} > s > #{i < n : = a, n = 1} 

which imphes V{uo,p) <1/2<1 = V;^. □ □ 

It is also easy to show that the uniformity of convergence in (1) can not be 
dropped. That is, if in the definition of value-stability we allow the function (p{n,e) 
to depend additionally on the past history z^k then Theorem 2 does not hold. This 
can be shown with the same example as constructed in the proof of Proposition 7, 
letting d{k,e) =0 but instead allowing ip{n,s,z<:k) to take values and 1 according 
to the number of actions a taken, achieving the same behaviour as in the example 
provided in the last proof. 

Finally, we show that the requirement that the class C to be learnt is count- 
able can not be easily withdrawn. Indeed, consider the following simple class of 
environments. An environment is called passive if the observations and rewards are 
independent of actions. Sequence prediction task is a well-studied (and perhaps 
the only reasonable) class of passive environments: in this task an agent gets the 
reward 1 if i/j = o^+i and the reward otherwise. Clearly, any deterministic passive 
environment i/ is strongly value-stable with di,{k,e) = 1, (p^{n,e) = and r^ — 1 for 
all i. Obviously, the class of all deterministic passive environments is not countable. 
Since for every policy p there is an environment on which p errs exactly on each 
step. 

Claim 8. The class of all deterministic passive environments can not be learned. 

6 Discussion 

We have proposed a set of conditions on environments, called value-stability, such 
that any countable class of value-stable environments admits a self-optimizing policy. 
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It was also shown that these conditions are in a certain sense tight. The class of all 
value-stable environments includes ergodic MDPs, certain class of finite POMDPs, 
passive environments, and (provably) other and more environments. So the con- 
cept of value-stability allows to characterize self-optimizing environment classes, 
and proving value-stability is typically much easier than proving self-optimizingness 
directly. 

We considered only countable environment classes C. From a computational 
perspective such classes are sufficiently large (e.g. the class of all computable proba- 
bility measures is countable). On the other hand, countability excludes continuously 
parameterized families (like all ergodic MDPs), common in statistical practice. So 
perhaps the main open problem is to find under which conditions the requirement of 
countability of the class can be lifted. Ideally, we would like to have some necessary 
and sufficient conditions such that the class of all environments that satisfy this 
condition admits a self-optimizing policy. 

Another question concerns the uniformity of forgetfulness of the environment. 
Currently in the definition of value-stability (1) we have the function (p{n,e) which 
is the same for all histories z^k, that is, both for all actions histories y<fc and 
observations- rewards histories x^k. Probably it is possible to differentiate between 
two types of forgetfulness, one for actions and one for perceptions. In particular, any 
countable class of passive environments (i.e. such that perceptions are independent 
of actions) is learnable, suggesting that uniform forgetfulness in perceptions may 
not be necessary. 

A Proof of Theorem 2 

A self-optimizing policy p will be constructed as follows. On each step we will have 
two polices: p* which exploits and which explores; for each i the policy p either 
takes an action according to (p(-2<i) =P*(^<i)) or according to p^ (p(-2<i) =p'^(-2<i)), 
as will be specified below. When the pohcy p has been defined up to a step k, each 
environment /i, endowed with this policy, can be considered as a measure on Z'^. We 
assume this meaning when we use environments as measures on Z'' (e.g. fi{z^i)). 

In the algorithm below, i denotes the number of the current step in the sequence 
of actions-observations. Let n — 1, s — 1, and j^=j'^ — 0. Let also — for sElN. 
For each environment find such a sequence of real numbers £^ that £^ — > and 



Let i:IV— »C be such a numbering that each ueC has infinitely many indices. 
For all i > 1 define a measure ^ as follows 



where w^&TZ are (any) such numbers that ^i^Wi, = l and w^>0 for all ueC. 
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Define T. On each step i let 

Define i/*. Set i/* to be the first environment in T with index greater than 

In case this is impossible (that is, if T is empty), increment s, (re)define T and try 

again. Increment j*. 

Define i/'^. Set u'^ to be the first environment with index greater than such 
that V*e > V*t and u^{z^k)>0: if such an environment exists. Otherwise proceed one 
step (according to p*) and try again. Increment j^. 

Consistency. On each step i (re)define T. If z/* ^ T, define z/*, increment s and 
iterate the infinite loop. (Thus s is incremented only if z/* is not in T or if T is 
empty.) 

Start the infinite loop. Increment n. 

Let 5:^{V*e-V*t)/2. Let If £<5 set 5^e. Let h^f. 

Prepcire for exploration. 

Increment h. The index h is incremented with each next attempt of exploring 
z/®. Each attempt will be at least h steps in length. 
Let p^=p^J:' and set p=p*. 
Let ih be the current step. Find ki such that 

< e/S (2) 



Find /c2 > 2ih such that for all m>k2 

1 



m-ih 



r — V* 

ih+l..m 



< e/8. (3) 



Find ks such that 

hrmax/k3<e/8. (4) 

Find A;4 such that for all m>ki 

—di,e{m,s/4:)<s/8, —dut{'m,e/8)<e/8 and —dut{ih,s/8)<e/8. (5) 
mm m 

Moreover, it is always possible to find such /c>max{A;i,A;2,^3,^4} that 

^<3. > ^ri.3k + (6) 

Iterate up to the step k. 
Exploration. Set p^—plf". Iterate h steps according to p=p^. Iterate further 
until either of the following conditions breaks 

(^) \<.^-rZ^<{^-k)e/MAKe/^), 



12 



(a) i<3k. 



{in) ly^eT. 

Observe that either (i) or (ii) is necessarily broken. 

If on some step z/* is excluded from T then the infinite loop is iterated. If after 
exploration z/^ is not in T then redefine u'' and iterate the infinite loop. If both 
I/* and are still in T then return to "Prepare for exploration" (otherwise the loop 
is iterated with either i/* or i/^ changed). 
End of the infinite loop and the algorithm. 

Let us show that with probability 1 the "Exploration" part is iterated only a 
finite number of times in a row with the same z/* and u^. 

Suppose the contrary, that is, suppose that (with some non-zero probability) 
the "Exploration" part is iterated infinitely often while u*,i/^eT. Observe that (1) 
implies that the i^^-probability that (i) breaks is not greater than (f^^{i—k,e/A); hence 
by Borel-Cantelli lemma the event that (i) breaks infinitely often has probability 
under z/^. 

Suppose that (i) holds almost every time. Then (ii) should be broken except 
for a finite number of times. We can use (2), (3), (5) and (6) to show that with 
probability at least l — ip^t(k — ih,e/4:) under we have ■^i"l''.^j^>V*t+e/2. Again 
using Borcl-Cantelli lemma and k > 2ih we obtain that the event that (ii) breaks 
infinitely often has probability under z/*. 

Thus (at least) one of the environments z/* and z/^ is singular with respect to 
the true environment given the described policy and current history. Denote this 
environment by i/'. It is known (see e.g. [CS04, Thm.26]) that if measures /i and i/ 
are mutually singular then ^('^i'--^") _> qo u-a.s. Thus 

4^ - i.-a.s. (7) 
Observe that (by definition of ^) ||f^ is bounded. Hence using (7) we can see that 



v-a.s. 



Since s and are not changed during the exploration phase this implies that on 
some step z/' will be excluded from T according to the "consistency" condition, which 
contradicts the assumption. Thus the "Exploration" part is iterated only a finite 
number of times in a row with the same z/* and z/^. 

Observe that s is incremented only a finite number of times since '^^j^^^^^ 
bounded away from where v' is either the true environment u or any environ- 
ment from C which is equivalent to u on the current history. The latter follows from 
the fact that is a submartingale with bounded expectation, and hence, by the 
submartingale convergence theorem (see e.g. [Doo53]) converges with i/-probability 
1. 
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Let us show that from some step on v (or an environment equivalent to it) is 
always in T and selected as . Consider the environment on some step %. If 
V*t > V* then i/* will be excluded from T since on any optimal for sequence of 
actions (policy) measures u and z/* are singular. If V*t < V* than will be equal 
to at some point, and, after this happens sufficient number of times, z/* will be 
excluded from T by the "exploration" part of the algorithm, s will be decremented 
and u will be included into T. Finally, if V*t = V* then either the optimal value V* is 
(asymptotically) attained by the policy pt of the algorithm, or (if p^^t is suboptimal 
for 1/) ^r^,i <V*t—£ infinitely often for some £, which has probability under i/* and 
consequently z/* is excluded from T . 

Thus, the exploration part ensures that all environments not equivalent to v with 
indices smaller than i(z/) are removed from T and so from some step on z/* is equal 
to (an environment equivalent to) the true environment i/. 

We have shown in the "Exploration" part that n— >^oo, and so — i^O. Finally, 
using the same argument as before (Borel-Cantelli lemma, {i) and the definition of 
k) we can show that in the "exploration" and "prepare for exploration" parts of the 
algorithm the average value is within e"^ of V*t provided the true environment is 
(equivalent to) i/*. □ 
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